Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja's rule \citep{oja1982simplified}), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks \citep{baldi1989neural}. In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset \citep{lecun2010mnist} and the reinforcement learning domain PuddleWorld \citep{sutton1995generalization} demonstrating the usefulness of our approach.
translated by 谷歌翻译
我们研究了分销RL的多步非政策学习方法。尽管基于价值的RL和分布RL之间的相似性明显相似,但我们的研究揭示了多步环境中两种情况之间的有趣和根本差异。我们确定了依赖路径依赖性分布TD误差的新颖概念,这对于原则上的多步分布RL是必不可少的。基于价值的情况的区别对诸如后视算法等概念的重要含义具有重要意义。我们的工作提供了多步非政策分布RL算法的第一个理论保证,包括适用于多步分配RL现有方法的结果。此外,我们得出了一种新颖的算法,即分位数回归 - 逆转录,该算法导致了深度RL QR QR-DQN-RETRACE,显示出对Atari-57基准上QR-DQN的经验改进。总的来说,我们阐明了多步分布RL中如何在理论和实践中解决多个独特的挑战。
translated by 谷歌翻译
连续的时间加强学习提供了一种吸引人的形式主义,用于描述控制问题,其中时间的流逝并不自然地分为离散的增量。在这里,我们考虑了预测在连续时间随机环境中相互作用的代理商获得的回报分布的问题。准确的回报预测已被证明可用于确定对风险敏感的控制,学习状态表示,多基因协调等的最佳策略。我们首先要建立汉密尔顿 - 雅各布人(HJB)方程的分布模拟,以扩散和更广泛的feller-dynkin过程。然后,我们将此方程式专注于返回分布近似于$ n $均匀加权粒子的设置,这是分销算法中常见的设计选择。我们的派生突出显示了由于统计扩散率而引起的其他术语,这是由于在连续时间设置中正确处理分布而产生的。基于此,我们提出了一种可访问算法,用于基于JKO方案近似求解分布HJB,该方案可以在在线控制算法中实现。我们证明了这种算法在合成控制问题中的有效性。
translated by 谷歌翻译
通过比较它们在大型任务套件上的相对性能来主要评估深度加强学习(RL)算法。大多数已发布的Deep RL基准的结果比较了总体性能的积分估计,如任务的平均值和中位数分数,忽略了使用有限次训练运行所暗示的统计不确定性。从街机学习环境(ALE)开始,转向计算苛刻的基准导致只评估每项任务的少量运行的实践,加剧了点估计中的统计不确定性。在本文中,我们认为,在少数运行深处的RL政权中的可靠评估不能忽视结果中的不确定性,而无需冒着现场降低进展的风险。我们使用对Atari 100k基准测试的案例研究来说明这一点,在那里我们在单独从点估计中汲取的结论之间发现了大量差异与更全面的统计分析。旨在提高现场对报告的据报道的诸如少数经营的业绩的信心,我们倡导报告总绩效的间隔估计,并提出性能概况来解释结果的可变性,以及现在更强大和高效的总数的绩效作为狭隘的平均分数,在结果中取得小的不确定性。使用此类统计工具,我们在包括ALE,Procgen和DeepMind控制套件的其他广泛使用的RL基准测试中仔细审查了现有算法的性能评估,再次在先前的比较中显示差异。我们的调查结果呼吁改变我们如何评估深度RL的性能,我们提出了更严格的评估方法,伴随着开源库的最新,以防止不可靠的结果停滞不前。
translated by 谷歌翻译
In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, . First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.
translated by 谷歌翻译
In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.
translated by 谷歌翻译
We consider an agent's uncertainty about its environment and the problem of generalizing this uncertainty across states. Specifically, we focus on the problem of exploration in non-tabular reinforcement learning. Drawing inspiration from the intrinsic motivation literature, we use density models to measure uncertainty, and propose a novel algorithm for deriving a pseudo-count from an arbitrary density model. This technique enables us to generalize count-based exploration algorithms to the non-tabular case. We apply our ideas to Atari 2600 games, providing sensible pseudo-counts from raw pixels. We transform these pseudo-counts into exploration bonuses and obtain significantly improved exploration in a number of hard games, including the infamously difficult MONTEZUMA'S REVENGE.
translated by 谷歌翻译
In this article we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domain-independent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by developing and benchmarking domain-independent agents designed using well-established AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. All of the software, including the benchmark agents, is publicly available.
translated by 谷歌翻译
本文开发了一个贝叶斯图形模型,用于融合不同类型的计数数据。激励的应用是从不同治疗方法收集的各种高维特征的细菌群落研究。在这样的数据集中,社区之间没有明确的对应关系,每个对应都与不同的因素相对应,从而使数据融合具有挑战性。我们引入了一种灵活的多项式高斯生成模型,用于共同建模此类计数数据。该潜在变量模型通过共同的多元高斯潜在空间共同表征了观察到的数据,该空间参数化了转录组计数的多项式概率集。潜在变量的协方差矩阵诱导所有转录本之间共同依赖性的协方差矩阵,有效地融合了多个数据源。我们提出了一种可扩展的可扩展性变异期望最大化(EM)算法,用于推断模型的潜在变量和参数。推断的潜在变量为可视化数据提供了常见的维度降低,而推断的参数则提供了预测性的后验分布。除了证明变异性程序的模拟研究外,我们还将模型应用于细菌微生物组数据集。
translated by 谷歌翻译
高斯过程(GP)回归是一种灵活的,非参数回归的方法,自然量化不确定性。在许多应用中,响应和协变量的数量均大,目标是选择与响应相关的协变量。在这种情况下,我们提出了一种新颖的可扩展算法,即创建的VGPR,该算法基于Vecchia GP近似,优化了受惩罚的GP log-logikelihiens,这是空间统计的有序条件近似,这意味着精确矩阵的稀疏cholesky因子。我们将正则路径从强度惩罚到弱惩罚,依次添加基于对数似然梯度的候选协变量,并通过新的二次约束坐标下降算法取消了无关的协变量。我们提出了基于Vecchia的迷你批次亚采样,该子采样提供了无偏的梯度估计器。最终的过程可扩展到数百万个响应和数千个协变量。理论分析和数值研究表明,相对于现有方法,可伸缩性和准确性的提高。
translated by 谷歌翻译